Miniproject 3: Landing on the Moon

Introduction

Description

Traditionally, reinforcement learning has operated on "tabular" state spaces, e.g. "State 1", "State 2", "State 3" etc. However, many important and interesting reinforcement learning problems (like moving robot arms or playing Atari games) are based on either continuous or very high-dimensional state spaces (like robot joint angles or pixels). Deep neural networks constitute one method for learning a value function or policy from continuous and high-dimensional observations.

In this miniproject, you will teach an agent to play the Lunar Lander game from OpenAI Gym. The agent needs to learn how to land a lunar module safely on the surface of the moon (at coordinate [0,0]). The state space is 8-dimensional and (mostly) continuous, consisting of the X and Y coordinates of the lander, the X and Y velocity, the angle of the lander, the angular velocity, and two booleans indicating whether the left and right leg of the lander have landed on the moon.

The agent gets a reward of +100 for landing safely and -100 for crashing. In addition, it receives "shaping" rewards at every step. It receives positive rewards for moving closer to [0,0], decreasing in velocity, shifting to an upright angle and touching the lander legs on the moon. It receives negative rewards for moving away from the landing site, increasing in velocity, turning sideways, taking the lander legs off the moon and for using fuel (firing the thrusters). The largest reward it can achieve on a step is about +-100. The best score an agent can achieve in an episode is about +250;

There are two versions of the task: one with discrete controls and one with continuous controls. In the discrete version, the agent can take one of four actions at each time step: [do nothing, fire engines left, fire engines right, fire engines down]. In the continuous version, the agent sets two continuous actions at each time step: the amount of engine thrust and the direction.

We will use Policy Gradient approaches to learn the task. In the previous miniprojects, the network generates a probability distribution over the outputs, and is trained to maximize the probability of a specific target output given an observation. In Policy Gradient methods, the network generates a probability distribution over actions, and is trained to maximize expected future rewards given an observation.

Prerequisites

  • If using docker, download the latest version of the image. Otherwise: You should have a running installation of tensorflow, keras, OpenAI Gym and Box2D.
  • You should know the concepts of "policy", "policy gradient", "REINFORCE" and "REINFORCE with baseline". If you want to start and haven't seen this yet in class, read Sutton & Barto (2018) Chapter 13 (13.1-13.4 and 13.7).

What you will learn

  • You will learn how to implement a policy gradient neural network using the REINFORCE algorithm.
  • You will learn how to implement baselines, including a learned value network.
  • You will learn how to adapt your network for both discrete and continuous control.

Notes

  • Reinforcement learning is noisy! Normally one should average over multiple random seeds with the same parameters to really see the impact of a change to the model, but we won't do this due to time constraints. However, you should be able to see learning over time with every approach. If you don't see any improvement, or very unstable learning, double-check your model and try adjusting the learning rate.

  • You may sometimes see "AssertionError: IsLocked() = False" after restarting your code. To fix this, reinitialize the environments by running the Gym Setup code below.

  • You will not be marked on the episode movies. If your notebook file is large before uploading, delete them.

Evaluation criteria

The miniproject is marked out of 15, with a further mark breakdown in each question:

  • Exercise 1: 5 points
  • Exercise 2: 2 points
  • Exercise 3: 3 points
  • Exercise 4: 5 points

We may perform random tests of your code but will not rerun the whole notebook.

In [1]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

Your Names

Before you start, please enter your sciper number(s) in the field below; they are used to load the data. The variable student_2 may remain empty, if you work alone.

In [2]:
sciper = {'student_1': 292070, 
         } #'student_2': 217033}
seed = sciper['student_1']#+sciper['student_2']

Setup

Dependencies and constants

In [3]:
import gym
import numpy as np
import matplotlib.pyplot as plt
import logging
from matplotlib.animation import FuncAnimation
from IPython.display import HTML, clear_output
from gym.envs.box2d.lunar_lander import heuristic

import keras
import tensorflow as tf
from tensorflow.contrib.distributions import Beta
from keras.models import Sequential, Model, model_from_json, load_model
from keras.layers import Dense, Lambda, Input, Dropout
from keras.optimizers import Adam
from keras import backend as K

import time, datetime
import dill
np.random.seed(seed)
tf.set_random_seed(seed*2)
/usr/local/lib/python3.5/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
In [4]:
import sys
import resource
print (resource.getrlimit(resource.RLIMIT_STACK))
print (sys.getrecursionlimit())

max_rec = 0x200000

# May segfault without this line. 0x100 is a guess at the size of each stack frame.
resource.setrlimit(resource.RLIMIT_STACK, [0x100 * max_rec, resource.RLIM_INFINITY])
sys.setrecursionlimit(max_rec)
(8388608, -1)
3000

Gym Setup

Here we load the Reinforcement Learning environments from Gym (both the continuous and discrete versions).

We limit each episode to 500 steps so that we can train faster.

In [5]:
gym.logger.setLevel(logging.ERROR)
discrete_env = gym.make('LunarLander-v2')
discrete_env._max_episode_steps = 500
discrete_env.seed(seed*3)
continuous_env = gym.make('LunarLanderContinuous-v2')
continuous_env._max_episode_steps = 500
continuous_env.seed(seed*4)
gym.logger.setLevel(logging.WARN)

% matplotlib inline
plt.rcParams['figure.figsize'] = 12, 8
plt.rcParams["animation.html"] = "jshtml"

Utilities

We include a function that lets you visualize an "episode" (i.e. a series of observations resulting from the actions that the agent took in the environment).

As well, we will use the "Results" class (a wrapper around a python dictionary) to store, save, load and plot your results. You can save your results to disk with results.save('filename') and reload them with Results(filename='filename'). Use results.pop(experiment_name) to delete an old experiment.

In [6]:
def AddValue(output_size, value):
    return Lambda(lambda x: x + value, output_shape=(output_size,))

def render(episode, env):
    
    fig = plt.figure()
    img = plt.imshow(env.render(mode='rgb_array'))
    plt.axis('off')

    def animate(i):
        img.set_data(episode[i])
        return img,

    anim = FuncAnimation(fig, animate, frames=len(episode), interval=24, blit=True)
    html = HTML(anim.to_jshtml())
    
    plt.close(fig)
    !rm None0000000.png
    
    return html

class Results(dict):
    
    def __init__(self, *args, **kwargs):
        if 'filename' in kwargs:
            data = np.load(kwargs['filename'])
            super().__init__(data)
        else:
            super().__init__(*args, **kwargs)
        self.new_key = None
        self.plot_keys = None
        self.ylim = None
        
    def __setitem__(self, key, value):
        super().__setitem__(key, value)
        self.new_key = key

    def plot(self, window):
        clear_output(wait=True)
        for key in self:
            #Ensure latest results are plotted on top
            if self.plot_keys is not None and key not in self.plot_keys:
                continue
            elif key == self.new_key:
                continue
            self.plot_smooth(key, window)
        if self.new_key is not None:
            self.plot_smooth(self.new_key, window)
        plt.xlabel('Episode')
        plt.ylabel('Reward')
        plt.legend(loc='lower right')
        if self.ylim is not None:
            plt.ylim(self.ylim)
        plt.show()
        
    def plot_smooth(self, key, window):
        if len(self[key]) == 0:
            plt.plot([], [], label=key)
            return None
        y = np.convolve(self[key], np.ones((window,))/window, mode='valid')
        x = np.linspace(window/2, len(self[key]) - window/2, len(y))
        plt.plot(x, y, label=key)
        
    def save(self, filename='results'):
        np.savez(filename, **self)
In [7]:
# my utilities
def save_model(model, name):
    model.save("obj/" + name + ".h5")

def load_my_model(name):
    model = keras.models.load_model("obj/" + name + ".h5")
    return model



def get_agent_reward(name):
    model_policy = load_my_model(name+ '_policy')
    
    try:
        model_baseline = load_my_model(name+ '_baseline')
    except OSError:
        with open("obj/" + name + ".pkl", 'rb') as f:
            recurrent_agent = dill.load(f)
            recurrent_agent.update_model(model_policy)
            rewards = dill.load(f)
        return recurrent_agent, rewards
    else:
        with open("obj/" + name + ".pkl", 'rb') as f:
            recurrent_agent = dill.load(f)
            recurrent_agent.update_model(model_policy, model_baseline)
            rewards = dill.load(f)
        return recurrent_agent, rewards

    
def save_agent_reward(recurrent_agent, rewards, name):
    with open("obj/" + name + ".pkl", 'wb') as f:
        dill.dump(recurrent_agent, f)
        dill.dump(rewards, f)

    save_model(recurrent_agent.model_policy, name + '_policy')
    print('model baseline:', recurrent_agent.model_baseline)
    if recurrent_agent.model_baseline:
        print('Saving model baseline!')
        save_model(recurrent_agent.model_baseline, name + '_baseline')

Test runs

To get an idea of how the environment works, we'll plot an episode resulting from random actions at each point in time, and a "perfect" episode using a specially-designed function to land safely within the yellow flags.

Remove these plots before submitting the miniproject, to reduce the file size.

In [8]:
def run_fixed_episode(env, policy):
    frames = []
    observation = env.reset()
    done = False
    while not done:
        frames.append(env.render(mode='rgb_array'))
        action = policy(env, observation)
        observation, reward, done, info = env.step(action)
    return frames
    
def random_policy(env, observation):
    return env.action_space.sample()

def heuristic_policy(env, observation):
    return heuristic(env.unwrapped, observation)
In [8]:
episode = run_fixed_episode(discrete_env, random_policy)
render(episode, discrete_env)
Out[8]:


Once Loop Reflect
In [9]:
episode = run_fixed_episode(discrete_env, heuristic_policy)
render(episode, discrete_env)
Out[9]:


Once Loop Reflect

Experiment Loop

This is the method we will call to setup an experiment. Reinforcement learning usually operates on an Observe-Decide-Act cycle, as you can see below.

You don't need to add anything here; you will be working directly on the RL agent.

In [9]:
num_episodes = 30

def run_experiment(RLAgent_es, experiment_name, env, num_episodes, learning_rate=0.001, baseline=None, old_params=None, graph=True):
    
    rewards = []
    startin_reward_len = 0
    #Initiate the learning agent
    if old_params:
        agent = old_params[0]
        rewards = old_params[1]
        startin_reward_len = len(rewards)
    else:
        agent = RLAgent_es(n_obs = env.observation_space.shape[0], action_space = env.action_space,
                    learning_rate = learning_rate, discount=0.9, baseline = baseline)
    
    all_episode_frames = []
    step = 0
    for episode in range(1, num_episodes+1):
    
        #Update results plot and occasionally store an episode movie
        episode_frames = None
        if episode % 10 == 0:
            results[experiment_name] = np.array(rewards)
            if graph:
                results.plot(10)
        if episode % 500 == 0 or episode == num_episodes:
            episode_frames = []
            
        #Reset the environment to a new episode
        observation = env.reset()
        episode_reward = 0

        while True: # in every episode there is a full trip to the moon
        
            if episode_frames is not None:
                episode_frames.append(env.render(mode='rgb_array'))

            # 1. Decide on an action based on the observations
            action = agent.decide(observation)

            # 2. Take action in the environment
            next_observation, reward, done, info = env.step(action)
            episode_reward += reward

            # 3. Store the information returned from the environment for training
            agent.observe(observation, action, reward)

            # 4. When we reach a terminal state ("done"), use the observed episode to train the network
            if done:
                rewards.append(episode_reward)
                print('episode number:', episode + startin_reward_len, 'reward:', episode_reward)
                if not graph:
                    print('episode number:', episode + startin_reward_len, 'reward:', episode_reward)
                if episode_frames is not None:
                    all_episode_frames.append(episode_frames)
                agent.train() # in this way I'm training every episode, at the end of the episode!
                break

            # Reset for next step
            observation = next_observation
            step += 1
            
    return all_episode_frames, agent, rewards

The Agent

Here we give the outline of a python class that will represent the reinforcement learning agent (along with its decision-making network). We'll modify this class to add additional methods and functionality throughout the course of the miniproject.

NOTE: We have set up this class to implement new functionality as we go along using keyword arguments. If you prefer, you can instead subclass RLAgent for each question.

In [10]:
class RLAgent(object):
    
    def __init__(self, n_obs, action_space, learning_rate, discount, baseline = None):

        #We need the state and action dimensions to build the network
        self.n_obs = n_obs
        #We'll treat the continuous case a bit differently
        self.continuous = 'Discrete' not in str(action_space)
        if self.continuous:
            self.n_act = action_space.shape[0]
            self.act_low = action_space.low
            self.act_range = action_space.high - action_space.low
        else:
            self.n_act = action_space.n
        self.lr = learning_rate
        self.gamma = discount
        
        self.moving_baseline = None
        self.use_baseline = False
        self.use_adaptive_baseline = False
        if baseline == 'adaptive':
            self.use_baseline = True
            self.use_adaptive_baseline = True
        elif baseline == 'simple':
            self.use_baseline = True

        #These lists stores the cumulative observations for this episode
        self.episode_observations, self.episode_actions, self.episode_rewards = [], [], []
        self.model_policy = None
        self.model_baseline = None
        #Build the keras network
        self._build_network()

    def observe(self, state, action, reward):
        """ This function takes the observations the agent received from the environment and stores them
            in the lists above. If necessary, preprocess the action here for the network. You may also get 
            better results clipping or normalizing the reward to limit its range for training."""
        raise NotImplementedError
        
    def decide(self, state):
        """ This function feeds the observed state to the network, which returns a distribution
            over possible actions. Sample an action from the distribution and return it."""
        raise NotImplementedError

    def train(self):
        """ When this function is called, the accumulated observations, actions and discounted rewards from the
            current episode should be fed into the network and used for training. Use the _get_returns function 
            to first turn the episode rewards into discounted returns. """
        raise NotImplementedError

    def _get_returns(self):
        """ This function should process self.episode_rewards and return the discounted episode returns
            at each step in the episode, then optionally apply a baseline. Hint: work backwards."""
        raise NotImplementedError

    def _build_network(self):
        """ This function should build the network that can then be called by decide and train. 
            The network takes observations as inputs and has a policy distribution as output."""
        raise NotImplementedError
    
    def update_model(self, model_policy, model_baseline = None):
        self.model_policy = model_policy
        print('Update model:', model_policy, model_baseline)
        if model_baseline:
            print('Loaded model baseline')
            self.model_baseline = model_baseline

Exercise 1: REINFORCE with simple baseline

Description

Implement the REINFORCE Policy Gradient algorithm using a deep neural network as a function approximator.

  1. Implement the "observe" method of the RLAgent above.
  2. Implement the "_build_network" method. Your network should take the 8-dimensional state space as input and output a softmax distribution over the 4 discrete actions. It should have 2-3 hidden layers with about 10-20 units each and ReLU activations. Use the REINFORCE loss function. HINT: Keras has a built-in "categorical cross-entropy" loss, and a "sample_weight" argument in fit/train_on_batch. Consider how these could be used together.
  3. Implement the "decide", "train" and "_get_returns" methods using the inputs and outputs of your network. In "_get_returns", implement a baseline based on a moving average of the returns; it should only be in effect when the agent is constructed with the "use_baseline" keyword. In "train", use train_on_batch to form a minibatch from all the experiences in an episode.
  4. Try a few learning rates and pick the best one (the default for Adam is a good place to start). Run the functions below and include the resulting plots, with and without the baseline, for your chosen learning rate. Plot the last movie from the baseline results.

WARNING: Running any experiments with the same names (first argument in run_experiment) will cause your results to be overwritten.

Mark breakdown: 5 points total

  • 3 points for implementing and plotting basic REINFORCE with reasonable performance (i.e. a positive score).
  • 2 points for implementing and plotting the simple baseline with reasonable performance.

Solution

In [27]:
class RLAgent_Ex1(RLAgent):
    
    def __init__(self, n_obs, action_space, learning_rate, discount, baseline = None):
        super(RLAgent_Ex1, self).__init__(n_obs, action_space, learning_rate, discount, baseline)
        self.epsilon = 0.1
        print('RLAgent 1')
        
    def observe(self, state, action, reward):
        """ This function takes the observations the agent received from the environment and stores them
            in the lists above. If necessary, preprocess the action here for the network. You may also get 
            better results clipping or normalizing the reward to limit its range for training."""
        self.episode_observations.append(state)
        self.episode_actions.append(action)
        self.episode_rewards.append(reward)
        
    def decide(self, state):
        """ This function feeds the observed state to the network, which returns a distribution
            over possible actions. Sample an action from the distribution and return it."""
        state = np.expand_dims(state, axis=0)
        
        actions = self.model_policy.predict(state)[0]
        actions_ind_prob = []
        return actions.argmax()
    
    def train(self):
        """ When this function is called, the accumulated observations, actions and discounted rewards from the
            current episode should be fed into the network and used for training. Use the _get_returns function 
            to first turn the episode rewards into discounted returns. """
            
        episode_steps = len(self.episode_observations)
        num_actions = 4
        inputs = np.asarray(self.episode_observations)
        targets = np.zeros((episode_steps, 1))
        moving_avarage_value, moving_avarage_index = [], 0
        
        print((self.episode_rewards[-3:]), sum(self.episode_rewards))
        print('actions:', self.episode_actions)
        
        for t in range(episode_steps):
            G = 0  # discounted returns
            for k in range(t, episode_steps):
                G += pow(self.gamma, k - t) * self.episode_rewards[k]
            
            if self.use_adaptive_baseline:
                state_reshaped = np.expand_dims(self.episode_observations[t], axis=0)
                G_reshaped = np.expand_dims(G, axis=0)
                _, _ = self.model_baseline.train_on_batch(state_reshaped, G_reshaped)
                adaptive_baseline = self.model_baseline.predict(state_reshaped)
                #print('adaptive baseline:', G, adaptive_baseline, G - adaptive_baseline)
                G = G - adaptive_baseline
            elif self.use_baseline:
                avg_period = 20
                if moving_avarage_index < avg_period:
                    moving_avarage_index += 1
                    moving_avarage_value.append(G)
                else:
                    moving_avarage_value.pop(0)
                    moving_avarage_value.append(G)
                #print('before baseline:', G,(sum(moving_avarage_value)) / moving_avarage_index)
                G = G - (sum(moving_avarage_value)) / moving_avarage_index

            targets[t] = pow(self.gamma, t) * G #, int(self.episode_actions[t])
            
#         print(inputs)
#         print(targets)
        loss, _ = self.model_policy.train_on_batch(inputs, targets)
#         print(loss)
        #These lists stores the cumulative observations for this episode
        self.episode_observations, self.episode_actions, self.episode_rewards = [], [], []

    def _get_returns(self):
        """ This function should process self.episode_rewards and return the discounted episode returns
            at each step in the episode, then optionally apply a baseline. Hint: work backwards."""

        

    def _build_network(self):
        """ This function should build the network that can then be called by decide and train. 
            The network takes observations as inputs and has a policy distribution as output """
        print(self.use_baseline, self.use_adaptive_baseline)
        optimizer_adam = Adam(lr= self.lr)
        model = Sequential()
        model.add(Dense(128, activation='relu', input_dim=8))
        model.add(Dropout(0.4))
        model.add(Dense(64, activation='relu'))
        model.add(Dense(32, activation='relu'))
        model.add(Dense(12, activation='relu'))
        model.add(Dense(4, activation='softmax'))
        model.compile(optimizer=optimizer_adam,
                      loss=REINFORCE, metrics=['acc'])
        self.model_policy = model
        
        if self.use_adaptive_baseline:
            optimizer_adam = Adam(lr=self.lr)
            model = Sequential()
            model.add(Dense(32, activation='relu', input_dim=8))
            model.add(Dense(20, activation='relu'))
            model.add(Dense(10, activation='relu'))
            model.add(Dense(1))
            model.compile(optimizer=optimizer_adam,
                      loss='MSE',
                      metrics=['accuracy'])
            self.model_baseline = model
            
In [28]:
def REINFORCE(target, output):
    # target[:,0]: disocounted reward, target[:, 1]: action taken
    reduced = tf.reduce_max(output, axis=-1)
    target = tf.reduce_max(target, axis=1)
    print(reduced)
    print('target:', target)
    a = tf.multiply(target, tf.log(reduced))
    a = tf.reduce_mean(a, axis=0)
    print(a) # uso a così per debugging
    return a # devo per caso mettere -a?

# Sum up losses instead of  mean
def categorical_crossentropy(target, output):
    _epsilon =  tf.convert_to_tensor(10e-8, dtype=output.dtype.base_dtype)
    output = tf.clip_by_value(output, _epsilon, 1. - _epsilon)  # selu
    return tf.reduce_sum(- tf.reduce_sum(target * tf.log(output),axis=len(output.get_shape()) - 1),axis=-1)

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)
In [29]:
learning_rate = 0.001
In [30]:
#Supply a filename here to load results from disk
results = Results()
In [31]:
start_time = time.time()
name = 'REINFORCE'
#agent, rewards = get_agent_reward(name)
episodes, recurrent_agent, rewards= run_experiment(RLAgent_Ex1, name, discrete_env, 10, learning_rate, 
                                                   #old_params=(agent, rewards),
                                                   graph=True)
print('saving models..')
#save_agent_reward(recurrent_agent, rewards, name)

print('time:', datetime.timedelta(seconds=time.time() - start_time)) 
episode number: 10 reward: -978.9581667427512
[-11.441703435891013, -22.02014478966683, -100] -978.9581667427512
actions: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
saving models..
time: 0:00:02.898306
In [262]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

SVG(model_to_dot(recurrent_agent.model_policy).create(prog='dot', format='svg'))
Out[262]:
G 140605180608184 dense_326_input: InputLayer 140605180608352 dense_326: Dense 140605180608184->140605180608352 140605180606616 dropout_38: Dropout 140605180608352->140605180606616 140608523714912 dense_327: Dense 140605180606616->140608523714912 140605088473328 dense_328: Dense 140608523714912->140605088473328 140605180462024 dense_329: Dense 140605088473328->140605180462024 140608398651688 dense_330: Dense 140605180462024->140608398651688
In [266]:
with tf.Session() as sess:
    writer = tf.summary.FileWriter('logs', sess.graph)
    writer.close()
In [175]:
render(episodes[-1], discrete_env)
Out[175]:


Once Loop Reflect
In [206]:
name = "REINFORCE (with baseline)"
#agent, rewards = get_agent_reward(name)
episodes, recurrent_agent, rewards= run_experiment(RLAgent_Ex1, name, discrete_env, 2, learning_rate=0.001, 
                                                   baseline='simple',
                                                   #old_params=(agent, rewards),
                                                   graph=True)


print('saving models..')
#save_agent_reward(recurrent_agent, rewards, name)
True False
Tensor("loss_65/dense_295_loss/Max:0", shape=(?,), dtype=float32)
target: Tensor("dense_295_target:0", shape=(?, ?), dtype=float32)
Tensor("loss_65/dense_295_loss/Mul:0", shape=(?, ?), dtype=float32)
RLAgent 1
episode number: 1 reward: -733.155138225624
[-8.045079427344062, -15.810721739950877, -100] -733.155138225624
actions: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
episode number: 2 reward: -571.955484032686
[-15.00097705499732, -15.363734962950188, -100] -571.955484032686
actions: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
saving models..
In [198]:
print(rewards[-1])
render(episodes[-1], discrete_env)
-196.85774997837026
Out[198]:


Once Loop Reflect

Exercise 2: Adaptive baseline

Description

Add a second neural network to your model that learns an observations-dependent adaptive baseline and subtracts it from your discounted returns, to reduce variance in learning.

  1. Modify the "_build_network" function of RLAgent to create a second "value network" when "adaptive" is passed for the baseline argument. The value network should have the same or similar structure as the policy network, without the softmax at the output.
  2. Subtract the simple baseline from the discounted returns as you did above to get the adjusted returns R - b.
  3. In addition to training your policy network, train the value network on the Mean-Squared Error compared to the adjusted returns.
  4. Train your policy network on R - b - b(s), i.e. the adjusted returns minus the adaptive baseline (the output of the value network).
  5. Try a few learning rates and plot all your best results together (without baseline, with simple baseline, with adaptive baseline). You may or may not be able to improve on the simple baseline! Return the trained model to use it in the next exercise.
  6. (Optional, no influence on grade) Try giving the policy and value networks different learning rates to see if you can improve performance.

TECHNICAL NOTE: Some textbooks may refer to this approach as "Actor-Critic", where the policy network is the "Actor" and the value network is the "Critic". Sutton and Barto (2018) suggest that Actor-Critic only applies when the discounted returns are bootstrapped from the value network output, as you saw in class. This can introduce instability in learning that needs to be addressed with more advanced techniques, so we won't use it for this miniproject. You can read more about state-of-the-art Actor-Critic approaches here: https://arxiv.org/pdf/1602.01783.pdf

Mark breakdown: 2 points total

  • 2 points for implementing and plotting the adaptive baseline with the other two conditions, with reasonable performance (i.e. at least similar to the performance in Exercise 1).

Solution

In [264]:
name =  "REINFORCE (adaptive baseline)"
#agent, rewards = get_agent_reward(name)
episodes, adaptive_agent, rewards= run_experiment(RLAgent_Ex1, name, discrete_env, 10, learning_rate, 
                                                   baseline='adaptive',
                                                   #old_params=(agent, rewards),
                                                   graph=True)


print('saving models..')
#save_agent_reward(adaptive_agent, rewards, name)
episode number: 10 reward: -574.9208500662077
[-12.40505326787118, -12.959612999033906, -100] -574.9208500662077
actions: [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
saving models..
In [265]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

SVG(model_to_dot(adaptive_agent.model_baseline).create(prog='dot', format='svg'))
Out[265]:
G 140604958672320 dense_356_input: InputLayer 140605004217984 dense_356: Dense 140604958672320->140605004217984 140604958774720 dense_357: Dense 140605004217984->140604958774720 140605004059088 dense_358: Dense 140604958774720->140605004059088 140605004163280 dense_359: Dense 140605004059088->140605004163280
In [134]:
render(episodes[-1], discrete_env)
Out[134]:


Once Loop Reflect

Exercise 3: Visualizing the Value Function

Description

Ideally, our value network should have learned to predict the relative values across the input space. We can test this by plotting the value prediction for different observations.

  1. Write a function to plot the value network prediction across [x,y] space for given (constant) values of the other state variables. X is always in [-1,1], and Y generally lies in [-0.2,1], where the landing pad is at [0,0].
  2. Plot the values for 3-4 combinations of the other 6 state variables, including [0,0,0,0,0,0]. The X and Y velocity are generally within [-1,1], the angle is in [-pi,pi] and the angular velocity lies roughly within [-3,3]. The last two inputs indicating whether the legs have touched the ground are 0 (False) or 1 (True). Use the same color bar limits across the graphs so that they can be compared easily.
  3. Answer the questions below in 1-2 sentences each.

Mark breakdown: 3 points total

  • 2 points for the plots of the value function.
  • 1 point for answering the questions below.

Solution

In [135]:
def grid_creation(x_density, y_density):
    xs1 = np.linspace(-1, 1, num=x_density)
    xs2 = np.linspace(-0.2, 1, num=y_density)
    xx, yy = np.meshgrid(xs1, xs2)  # create the grid
    ex = np.zeros((len(xx) * len(xx[0]), 2))
    print(ex.shape)
    for j in range(y_density):
        for i in range(x_density):
            ex[y_density * i + j, 0] = xx[j, i]
            ex[y_density * i + j, 1] = yy[j, i]
    return xx, yy, ex

def fill_input_tensor(grid_points, X, Y, wL, wV, pad0, pad1):
    inputs = np.zeros((grid_points.shape[0], 8))
    inputs[:, 0] = grid_points[:,0]
    inputs[:, 1] = grid_points[:,1]
    inputs[:, 2] = X
    inputs[:, 3] = Y
    inputs[:, 4] = wL
    inputs[:, 5] = wV
    inputs[:, 6] = pad0
    inputs[:, 7] = pad1
    return inputs
In [136]:
baseline_net = adaptive_agent.model_baseline
In [137]:
x_density, y_density = 200, 120
xx, yy, grid_points = grid_creation(x_density, y_density)
for t in range(4):
    if t == 0:
         X, Y, wL, wV, pad0, pad1 = 0, 0, 0, 0, 0, 0
    else:
        X = np.random.uniform(-1, 1)
        Y = np.random.uniform(-0.2, 1)
        wL, wV = np.random.uniform(-np.pi, np.pi), np.random.uniform(-np.pi, np.pi)
    inputs = fill_input_tensor(grid_points, X, Y, wL, wV, pad0, pad1)
    predictions = baseline_net.predict(inputs)
    predictions = np.squeeze(predictions)
    classification_plane = predictions.reshape((y_density, x_density))
    plt.figure(t)
    plt.contourf(xx, yy, classification_plane, cmap=plt.cm.jet)
    plt.colorbar()
    plt.title('X:' + str(X) + ' Y:' + str(Y) + ' wL:' +  str(wL) + ' wV:' + str(wV) + ' pad0:' + str(pad0) + ' pad1:' + str(pad1))
(24000, 2)

Question: Is there a combination of variables in the ranges above for which you see the highest rewards? Do they make sense?

Answer:

Question: What about outside of the ranges above? Why might these produce higher values?

Answer:

Question: Are the values higher before or after the legs touch the surface? Why?

Answer:

Exercise 4: Continuous action space

Description

One disadvantage of Q-learning-type approaches is that they require that the agent take discrete actions ("left", "right", "up", "down" etc.). In policy gradient, the agent learns a distribution over actions for each observation. That distribution can be either discrete (as we saw above) or continuous.

Here we will switch to continous actions. The agent has a 2D action at each time step: a value in [-1,1] to control the amount of thrust, and a value in [-1,1] to control the left/right direction of the thrust. Since the output is bounded, we will model it with a Beta distribution: http://en.wikipedia.org/wiki/Beta_distribution.

A Beta distribution is defined by 2 parameters: alpha and beta. The network should output both for each action. We will ensure that alpha >= 1 and beta >= 1, which keeps the distribution unimodel and well-behaved. The agent then samples from a distribution defined by [alpha,beta] for each action and transforms the [0,1] output to [-1,1] to act.

Modify your model in the following ways when it detects that self.continuous is True:

  1. Your policy network should have 4 outputs: one alpha and one beta for each action. Use a softplus output, which ensures the output is >=0. Then use the AddValue function defined above to add 1 to each unit in the output layer.
  2. Create a custom Keras loss function to calculate the log probability of the episode actions taken under the policy. HINT: look at what you can do with the tensorflow.contrib.distributions.Beta module imported above.
  3. Adjust the other methods appropriately for the continuous case (use np.random.beta to make a decision).
  4. Rerun the agent for the continuous case with several different learning rates and either the simple or adaptive baseline, and plot an episode.
  5. Finally, adapt your function from the Exercise 3 to plot the expected thrust and thurst direction across the XY space for your best model. Use several values of the other state variables, including [0,0,0,0,0,0]. Use the expectation of the Beta distribution. There should be 2 plots for each state variable condition: one for thrust and one for thrust direction.

Mark breakdown: 5 points total

  • 3 points for plotting the results with continuous actions with several learning rates, with reasonable performance.
  • 2 points for plotting the policy in XY space under several conditions.

Solution

In [186]:
class RLAgent_Ex4(RLAgent):
    
    def __init__(self, n_obs, action_space, learning_rate, discount, baseline = None):
        super(RLAgent_Ex4, self).__init__(n_obs, action_space, learning_rate, discount, baseline)
        self.epsilon = 0.1
        print('RLAgent 4')
    def observe(self, state, action, reward):
        """ This function takes the observations the agent received from the environment and stores them
            in the lists above. If necessary, preprocess the action here for the network. You may also get 
            better results clipping or normalizing the reward to limit its range for training."""
        self.episode_observations.append(state)
        self.episode_actions.append(action)
        self.episode_rewards.append(reward)
        
    def decide(self, state):
        """ This function feeds the observed state to the network, which returns a distribution
            over possible actions. Sample an action from the distribution and return it."""
        state = np.expand_dims(state, axis=0)
        
        output = self.model_policy.predict(state)[0]
        action1_value = np.random.beta(output[0], output[1])* 2 - 1
        action2_value = np.random.beta(output[2], output[3])* 2 - 1
        return np.array([action1_value, action2_value])
        
    def train(self):
        """ When this function is called, the accumulated observations, actions and discounted rewards from the
            current episode should be fed into the network and used for training. Use the _get_returns function 
            to first turn the episode rewards into discounted returns. """
        
        episode_steps = len(self.episode_observations)
        num_actions = 2
        inputs = np.asarray(self.episode_observations)
        targets = np.zeros((episode_steps, num_actions+1))
        moving_avarage_value, moving_avarage_index = [], 0
        #print(sum(self.episode_rewards))
        for t in range(episode_steps):
            G = 0  # discounted returns
            for k in range(t+1, episode_steps):
                G += pow(self.gamma, k - t - 1) * self.episode_rewards[k]
            
            if self.use_adaptive_baseline:
                state_reshaped = np.expand_dims(self.episode_observations[t], axis=0)
                G_reshaped = np.expand_dims(G, axis=0)
                _, _ = self.model_baseline.train_on_batch(state_reshaped, G_reshaped)
                adaptive_baseline = self.model_baseline.predict(state_reshaped)
                #print('adaptive baseline:', G, adaptive_baseline, G - adaptive_baseline)
                G = G - adaptive_baseline
            elif self.use_baseline:
                avg_period = 20
                if moving_avarage_index < avg_period:
                    moving_avarage_index += 1
                    moving_avarage_value.append(G)
                else:
                    moving_avarage_value.pop(0)
                    moving_avarage_value.append(G)
                #print('before baseline:', G,(sum(moving_avarage_value)) / moving_avarage_index)
                G = G - (sum(moving_avarage_value)) / moving_avarage_index
            print(G)
            targets[t] = (self.episode_actions[t][0]+1)/2, (self.episode_actions[t][1]+1)/2, (pow(self.gamma, t) * G)
        loss = self.model_policy.train_on_batch(inputs, targets)
        print('loss:', loss)
        #These lists stores the cumulative observations for this episode
        self.episode_observations, self.episode_actions, self.episode_rewards = [], [], []

    def _get_returns(self):
        """ This function should process self.episode_rewards and return the discounted episode returns
            at each step in the episode, then optionally apply a baseline. Hint: work backwards."""

        

    def _build_network(self):
        """ This function should build the network that can then be called by decide and train. 
            The network takes observations as inputs and has a policy distribution as output """
        print(self.use_baseline, self.use_adaptive_baseline)
        optimizer_adam = Adam(lr=self.lr)
        state_input = Input(shape=(8,))

        h1 = Dense(24, activation='relu')(state_input)
        h2 = Dense(48, activation='relu')(h1)
        h3 = Dense(24, activation='relu')(h2)
        output = Dense(4, activation='softplus')(h3)
        final_output = Lambda(lambda x: x + 1, output_shape=(output.shape[0],))(output)
        model = Model(input=state_input, output=final_output)
        adam  = Adam(lr=0.001)
        model.compile(loss=beta_loss, optimizer=adam)
        self.model_policy = model
        
        
        if self.use_adaptive_baseline:
            optimizer_adam = Adam(lr=self.lr)
            model = Sequential()
            model.add(Dense(32, activation='relu', input_dim=8))
            model.add(Dense(20, activation='relu'))
            model.add(Dense(10, activation='relu'))
            model.add(Dense(1))
            model.compile(optimizer=optimizer_adam,
                      loss='MSE',
                      metrics=['accuracy'])
            self.model_baseline = model

def beta_loss(target, output):
    # VOGLIO COME TARGET LE AZIONI CHE HO FATTO (CASOMAI MOLTIPLICATE PER UN TARGET)
    # ho l'output che  è alpha e beta (due coppie), devo trovare la probabilietà per ogni coppia e poi fare la log
    action1_prob = Beta(output[:, 0], output[:, 1])
    action2_prob = Beta(output[:, 2], output[:, 3])
    a = action1_prob.log_prob(target[:, 0])
    b =  action2_prob.log_prob(target[:, 1])
    result = tf.reduce_sum((a + b) * target[:, 2], axis=-1)
    return result
    
In [187]:
learning_rates = [0.001]
c_models = []
results.plot_keys = []
for lr in learning_rates:
    experiment_name = ("Continuous REINFORCE (learning rate: %s)" % str(lr))
    results.plot_keys.append(experiment_name)
    episodes, model, rewards = run_experiment(RLAgent_Ex4, experiment_name, continuous_env, 49, lr, baseline='simple')
    c_models.append(model)
episode number: 40 reward: -483.91169709034375
0.0
-0.8130323792024718
0.5132548201060327
0.32071907683706335
0.09676429229511463
-0.17889573750477972
-0.6246129935835185
-1.5838897880266138
-0.07189179753556019
-0.4199902703095404
-1.3488162001236201
-2.740992866390078
-2.456552662019096
-2.8749863580918014
-3.8825026678601695
-4.271903253391438
-4.486540266690476
-4.748027526285696
-4.555156031951691
-4.640025158823409
-4.775278538213309
-3.3406655414116013
-3.8931614758053854
-4.398913466298627
-3.3270373918541267
-3.604297454971828
-3.8489776851326027
-3.9044835821087247
-4.279103540960389
-4.6309156878970725
-4.817918879847397
-5.272344932439548
-5.054153883327633
-5.574939944522665
-6.141697498266819
-5.354197198102597
-5.95412284998104
-4.9337501812033295
-5.015192746231591
-5.769163600106285
-5.491261198136339
-5.984577262531939
-5.871204967706078
-2.8843462635063766
-3.2325453584026107
-4.705266694397718
-4.901884501548452
-4.197851475259675
-4.406458684766072
-2.968730074587601
-3.8526772151368505
-2.513211403027203
-2.71884939071019
-3.582274936591112
-2.4633857222768647
-3.483486273493142
-3.3547300939358493
-2.8003279511309813
-2.9236465216892356
-3.9925214123667985
-5.112943124703406
-4.600947945203657
-3.9297538106648915
-5.174581735171401
-6.524630699924444
-8.776439189702515
-10.617085458212351
-12.059918642269452
-11.611379903008434
-12.750244829656552
-14.591111613360667
-14.752289236869004
-16.93188570276974
-18.666891490931796
-16.514752523203697
-14.09349839287497
-14.948925593172795
-11.769384871925162
-11.863563457569384
-13.012820913172689
-14.901632224579942
-16.181591063934206
-18.150392972506978
-15.955239021643237
-16.292605937234953
-18.30197206629512
-21.088534759174763
-19.79942357189414
-18.421923408330898
-16.798445877912044
-15.685404766798385
-14.637201222323668
-13.696219845544718
-16.017863535873772
-15.14757574536715
-9.983467227163572
-12.82052200391783
-8.176246411827563
-10.96302435257536
-14.058695816174918
-8.749037885403382
-6.717268264600008
-6.663743084066411
-9.885627692410907
-7.674192773887768
88.23271040749069
loss: -3.1311297
episode number: 41 reward: -349.4912447028687
0.0
0.47567659310816524
-0.18010485379191365
-1.4436455204625815
-0.11454741775450117
-0.1927705511031732
-1.453381854582421
-1.5858283161803968
-0.7377782651642768
-2.380889915585069
-2.2244137969116267
-2.2789012028516034
-3.90390072632324
-2.430487966537683
-1.3757080727318964
-0.04833447336524088
-0.5887900165736077
0.788724280479915
0.20073144769992624
-1.648221085757914
-1.8789823653992066
-1.63836118414509
-2.6131811374070395
-0.9785176574833159
-1.179379532816533
-1.9524442954051473
-3.9729591304020815
-6.0625936322694045
-4.636807562251956
-6.056391272671384
-6.91191747106857
-6.675150199678534
-4.679917168009201
-3.8213597097849004
-2.7572069457868125
-3.079358037224331
1.764747350441354
2.2313600314113593
1.5791279108752008
0.8383832960216147
1.4092114943686482
-0.16190263232635616
2.6372753997127276
2.0463539792252305
3.448279353171433
2.675516799874149
1.7930850403133398
0.158350960403304
0.06737760618429078
-1.0914611208473417
-0.394968259465589
1.6350650362365151
1.9435713172097486
4.304462654737613
2.4412361732055885
3.7668272829184986
1.8270266453751773
0.7985649936437902
2.9283712625238536
1.9200117918959627
0.9255318916779522
-0.17024525552063707
-0.7541588225363185
-0.2560534479297978
-1.3423310859445863
-2.4974194550011575
-3.7288401711923775
-4.066583103574377
-5.3943087283692
-6.235855839028552
-5.663336376804459
-3.8455936122787655
-5.050084912402809
-4.199646497689031
-2.7444927943185675
-3.513751049102968
-3.995161512502829
-2.5916865930234145
-3.3682275146419443
-4.8601527970494764
-5.9693132661654005
-7.185396931337941
-9.06828940835576
-11.474590701113183
-12.649277857074217
-12.039130798736153
-14.012575720793983
-13.91374103253687
-16.123519940786615
-17.903891444034556
-19.138573782800222
-19.205857047533954
-19.852518389348674
-18.684851586958224
-21.167441224790778
-21.727976959032105
-24.5673606545204
-26.74373172324661
-29.857515963151165
-31.292569238073668
-34.52580274772944
-36.24405006611433
61.79681795278742
loss: -1.5967448
episode number: 42 reward: -603.7059565899842
0.0
1.1470137602030808
-0.7812522775276382
-1.415080272310874
-3.5293345885672465
-1.64645167444343
-1.208536657556075
-0.8323125158373212
-0.5426301638652928
1.6734853304575295
3.9709300671730796
3.155145863219853
0.11465812872599757
-1.9315392291108937
-5.100295455142437
-5.882173503353607
-7.040743423669597
-8.801511555571857
-6.513802732492161
-4.196907218728928
-4.714277375690494
-6.042222787318541
-3.5078652513540103
-3.3720341222770056
-0.8633865341675655
-1.9024765579241842
-1.4346401469831278
-2.3763679245699176
-3.1385338385122648
-3.7012742510886474
-6.432355165526072
-8.908617995849223
-9.815732837174831
-9.342445383915184
-9.247186326344568
-9.120557068523466
-10.616207849274005
-11.138878216411463
-9.152003952547654
-8.834518172752546
-8.316257612145648
-6.5062636604853346
-7.033348135527042
-5.381555647149106
-6.619862784379105
-4.864396788472284
-4.60370512381926
-3.030319735991764
-2.9294008586902756
-2.275326080858921
-3.1913219955825944
-2.4205855212135
-1.725755188557617
-1.0356175597493191
-0.3634569787222013
-1.425517416859961
-2.6895257089364932
-2.3204123761260895
-4.49412520697571
-4.2284734093325795
-3.984562707735318
-4.739583476496762
-4.203455787204625
-4.02951342158503
-3.029143313990083
-2.8518234275349528
-3.7228755690895667
-4.081364019125907
-3.862276132135701
-4.644643722606677
-4.554693914006592
-4.062103924577286
-4.6384496660692704
-5.3407449029080425
-5.634792268511234
-7.221953298023784
-7.946199501363893
-6.535551509691167
-8.263556739908744
-9.548663240714852
-9.752779435303822
-10.761772610429162
-12.352386415845658
-13.335385486246132
-14.04669390419231
-16.76942914924517
-17.336546308232684
-18.950561166859885
-21.227006226536844
-19.426285236296607
-19.662232758599508
-18.900427011345514
-21.266339749622375
-21.943621796322233
-24.626930768157116
-25.096055977690682
-23.184247079129797
-26.749879554756156
-28.002056040750006
-31.9487979043667
-32.41817064423205
-35.49482429198625
-37.2432976307276
-41.63977011175186
-57.00396761862396
-77.23401212732855
-68.02478709699034
-59.09800474401429
-56.65845138322388
-43.47029290983332
-32.79054398455281
-24.513567395427998
-21.12197516986093
-10.397002137729388
-6.629136526857081
-2.6618535460391684
1.2273826816858104
8.269811285922685
16.316051085258366
22.344878055366493
27.801156189108355
32.39726996676899
37.068558196462675
130.87077091267628
loss: -1.7600715
episode number: 43 reward: -418.02143071478804
0.0
0.1418484031003242
-0.3284762812606479
0.601961979523042
0.2765753128535491
-0.3209047657296318
-0.12449303034130432
0.1403440314041755
0.4840105566160755
1.0565734325199534
0.19746381095570875
1.0577838695885071
1.3050555267341597
2.5134124549058354
2.0092838311658188
2.106015866903999
0.5550294363146424
0.5086741270403805
0.1665532435498589
0.21747785708211254
-0.1784733672385599
-0.35623217458077683
0.2420956095936213
0.30124256112049363
-0.74912965169462
-0.5055157762390357
-1.0451866938825534
-1.3603158558375532
-2.5272601536457766
-2.9276353740356895
-3.2453993252435795
-3.756817450221322
-3.859472769649065
-5.516932751930836
-4.833016103901075
-4.151115187278722
-3.0063975248744566
-3.1341223202491904
-3.1095946136261023
-1.8087951761317242
-0.4776817684549899
0.7166209588215464
0.38457895033557676
-0.9770622305144787
-0.7157653749268515
-1.895130052463351
-2.458730295786264
-3.9618811544429526
-3.034144530088815
-1.603171948959063
-2.284596232111088
-3.7004084181161065
-2.399101277288887
-5.923396811978607
-4.293909848956805
-3.285885010147137
-3.898685752982132
-2.094271478709988
-1.0646033354970168
0.1301946958042315
1.4989328316036907
3.04224890713021
4.686307501935755
3.6084689087831388
5.441209400832273
2.411933617526241
0.3214788665825896
-0.6828427339184255
-2.15999114492627
-5.753598962030768
-7.191892232863447
-10.220944283133122
-11.872661602820202
-10.922544121207137
-9.585286010381289
-11.44898222025618
-10.124019268622755
-10.916605050095699
-9.170920734357578
-7.110290486697593
-7.690939436499491
-5.183131431074674
-9.239085469951444
-10.904701061022013
-11.252195459852807
-9.033345188446532
-10.399648280019129
-12.623672150700369
-11.053678096601548
-9.68846856618655
-9.196011555034588
-12.490536290431098
-11.603455851950326
-11.948692295380564
-11.641995394693737
-11.343733536102917
-12.488234314358152
-12.15018662794894
-13.459685465405158
-12.423227330511352
-11.983902919382338
-11.087423496570217
-10.129269333218282
-10.979471254022158
-10.815627446783019
-11.413978228231592
-12.25112584249176
-11.029597169668577
-12.049368395885232
-12.448713859619687
-11.945498696476342
-14.640927046006894
-13.782456586923828
-11.556528023665443
-14.032799723273826
-12.32596329942671
-12.732842477883999
-10.509134392076078
-13.79171148396577
-13.921090915349417
-18.067654642916693
-22.287790675466198
-26.837299751835744
-26.94039514199271
-28.24800268278915
-28.15785875376136
-33.50279068639267
-50.23929877497204
-51.15143600120143
-44.53985711827919
-32.59502360168274
-11.166147532913499
-3.359515404125034
93.26273881446258
loss: 0.8684666
episode number: 44 reward: -275.8607389379355
0.0
-0.29397632747146574
-0.12286455211525649
0.6688427717908354
-0.4763418554249679
-0.8218660502287598
-1.1121053378216192
0.36188622565331485
1.9803431144127241
2.582380672600176
2.263722989292628
2.7071416522774454
1.7804320693555375
1.5613462218703509
1.4316158466953466
0.9764678628226067
-0.35946792161533203
0.03714022868824429
-0.14181094909024416
-0.24104997990607302
-0.2397090895213836
0.2850135795455966
0.3031358201800778
0.431396384017388
1.127251740420033
0.8097050261011063
0.7374509272084513
0.8139275080478363
1.0577493281279793
0.7977827543201483
1.3188157123483215
1.223345993754866
1.4984882562983186
1.761936293262644
2.093807888696812
0.554313206186781
-0.5751599218194388
-2.0369063081156984
-2.1539747679918424
-1.4395639926615669
-1.4222191305402392
-0.13326364056544548
0.3566521207508444
0.16246063175956316
0.6888454189965616
-0.2711120559891471
-0.4628477579979151
-0.17879062492586062
-1.0388768433833206
-0.7672261471710122
-0.41438036739987627
-1.1052875927014423
-1.3173745249096998
-1.4867840905957035
-2.1956665029702993
-2.31278612941996
-3.0364774125891127
-4.450956323475117
-4.789663811696187
-5.88303178673215
-4.805344987426306
-4.031246865761204
-3.6137593190134414
-4.487817713652397
-5.311093232864373
-4.013264032441125
-4.173801769211032
-4.046924601893721
-3.9434210144073987
-4.230192212528095
-4.10444772641708
-3.377105481356539
-2.206724395608406
-1.7364106780600483
-1.873074554344976
-4.262446223033045
-6.292796425145884
-6.5070367892376595
-6.749303968465416
-6.203726287137751
-6.284869966587145
-6.9329615886059806
-6.695443161757762
-6.935395234546586
-6.720007462361867
-7.309034838912488
-9.215535590920979
-7.6465238778099085
-8.131831821502324
-7.009302497655387
-6.715455581270106
-5.903394049270936
-5.401771066520411
-6.024308305065674
-6.356090609829415
-5.840228404283785
-7.537503753527577
-7.909965902733681
-8.401876925664645
-9.766223381141039
-10.245801349476672
-10.230752308966089
-11.79391061793688
-12.804045540018507
-13.45413583952488
-13.218583275186212
-14.42278452508365
-14.64887735550279
-16.043315907664308
-18.346564375888534
-19.174102731619868
-20.30669046309451
-22.697819883696994
-24.973038741712912
-27.896812283329844
-30.652764439445093
-34.759792160900815
-38.247043591570346
59.77401055722377
loss: 0.530601
episode number: 45 reward: -338.9235710407508
0.0
-0.2771663747634818
1.1333576751512808
-0.1478852357445053
0.968647493674391
0.648943067110558
0.9786746150839791
0.7667269437554927
2.306633820695172
2.4466546636480984
1.5741292352979235
0.20024255266207902
0.8687980440760388
1.2782045690877801
1.6145704730465358
2.317167512276651
1.6644553537457236
1.2253126256818048
1.080988534336396
1.1376673982818195
1.2821385588038803
1.6053739546229435
2.2346363427151297
1.646288686868715
2.167131234923394
3.2321711302402485
2.441560137139214
1.0740016113106083
1.3908907100754462
-0.2797415716141729
-0.14043987760236165
0.5935165520702452
0.026019459692597113
0.28571344618746153
0.7532024809240236
2.08886581111197
1.599727217502906
2.073088400349796
2.313309118200594
1.6550318498913523
0.5923019175507489
-0.16400957891489032
-0.4551842287332719
-0.019004653115849646
-0.0670319626579039
-0.40656620941886823
0.13994785254712028
1.484240361246713
1.8357842087983771
2.3625409566275746
1.9931927837304886
2.631497536384624
1.6632558149902232
0.13978225436941738
-0.6823822343006765
-0.020993365954271814
0.6941446927826422
-1.1182196704922442
-0.8412293142892848
-0.25151869991880127
-0.2790999167439847
-0.9513518553110155
-0.5595421612217095
-0.8074908975030525
-1.4280964281169304
-2.7381292473548218
-3.5895597105463186
-4.941997447825852
-5.60510884420346
-5.7412046603742395
-4.1861008321196795
-4.388124704733631
-4.668878323454001
-2.6614326848522505
-2.65226614126823
-0.5088611673564749
-1.9912289080793544
-1.8908925783578905
-4.263536557436971
-4.543778444354865
-2.617769252196954
-0.5832098469802283
1.8709062333710156
4.304117751114283
4.967555044313628
3.71071655990306
1.1224250122372412
3.0346099225306338
3.2686988179195535
2.630630709736554
1.3525975139029027
3.3598519461116942
0.7942476583167792
0.649061081108683
0.628013980113356
-1.12779582222716
0.8800623704921762
-1.4315045612658581
-4.214385332197474
-5.4650193660579625
-7.8685790445893815
-9.608342051612684
-11.824560005206473
-9.609407620648804
-9.310278560467728
-6.991903124464628
-4.707409957974777
-2.0522472616452774
-2.67716516010044
-5.374943988706786
-3.2748763539934984
-6.1170475907737245
-7.058421234437898
-5.0865344736007465
-6.33022971794
-6.814230420910981
-4.940994858226107
-3.1556462414785837
-3.3114616278935287
-1.6482427780722304
-0.07660119950955391
0.006669653571094614
0.4125018886299685
-0.27828435991056466
0.2897081740923433
-0.8044083944145548
-3.2330213821225477
-4.605828236679809
-3.992455628775584
-3.4412759987695978
-4.907375755032696
-4.955796331133524
-6.397604717827775
-7.233580334719132
-7.116603395123871
-9.017295940011067
-8.470170300772015
-7.143725055824124
-8.021674007299502
-7.9260051940742
-7.9242179344301
-7.212259293263635
-7.083940563242248
-7.567591319613143
-7.687880583433753
-7.610120242734073
-8.103942917510501
-9.98502219605514
-10.435954282132023
-11.307231531267519
-12.1902080056025
-13.983264730371445
-14.417546085127519
-15.96089413262792
-17.666235017769715
-19.650907620416902
-20.413346620692145
-22.023586768194193
-24.671575352446766
-25.970560152830217
-26.676371396410673
-29.90449983439275
-33.62205276101476
-48.19858304513655
-71.0679082265171
-60.68271754614847
-54.26148487666788
-48.65252870555645
-24.98584627537089
-29.54144131834164
-13.508494107586358
84.05469963442889
loss: 1.626268
episode number: 46 reward: -162.45310914263735
0.0
0.004697502803024278
-0.2584358707077534
0.06717372526477838
0.4165757602072997
-0.1913450169935782
0.26227276220054563
0.1964512996683645
1.3900119005784317
2.054152938902126
2.812777050406532
0.9491569877789967
2.428299657002314
3.5286581626457854
4.420882809113603
5.444248659075429
4.432109934762133
4.0734833507325705
4.0391667451779165
2.11989552111522
1.035679858246569
2.439672997689853
3.3070513460289677
3.993782972985106
2.7355247259528124
3.3382372620608214
1.6745473929613757
2.181433257150208
1.4475158348141495
0.34163500800389546
0.8390838497512392
-1.2161241906514677
-0.8151397199047219
-0.16414132409441473
-0.6125009834210373
0.3157350189853432
-0.681059852295439
-1.0393639491055349
-0.020024523012273754
-0.4594823341246137
-0.7662921440678678
0.11791455086192038
1.2986910412282098
0.4459452476140746
1.2513226557252497
1.151781983007079
2.6731495928950046
2.8541390536243014
3.6575746726778897
3.2904510000126646
2.6067721514740345
0.9522087491475875
1.3393701749692881
-0.6539049515606026
-0.1800366634660423
-1.0960663569462432
-0.4475583893158812
-1.4687102064909883
-0.8336833099606722
-2.2820521544946133
-2.474072591084632
-2.896560993377776
-2.1548210963013523
-2.54643219133109
-2.7509066004482072
-2.7329439944058653
-4.4930963835705695
-7.515745944560344
-6.842088772400725
-8.219632056231369
-9.574453979977376
-9.870186848536115
-9.508467888818803
-12.355964156895507
-10.384342911648957
-11.208676857270053
-11.247292763393123
-12.35570123495321
-12.837454572574956
-13.435022560556344
-16.32294548354143
-17.970859292367848
-18.072297698172374
-17.98314163631043
-17.156717666516606
-18.512838114361728
-21.25514631965131
-23.807679103027684
-24.792118559295755
-28.682653958262307
-31.28862008094292
-33.4377662585408
-35.589955680070645
-39.3909177553043
-42.19900582229304
-54.27530944533245
-43.745563096314825
-42.285518382839314
56.49410639243713
loss: 2.0260181
episode number: 47 reward: -173.86562896532195
0.0
-0.16886408527527053
-1.0655264994041982
-0.44842530782810286
-1.4119062590867077
-0.8598137070829672
-0.3555914780874989
0.7267540631847869
1.3444090858418605
2.0085625168500556
2.7328859460627637
3.5297625373731254
2.481296972548428
1.6505767608614352
0.633186237560416
-0.07264360232237266
-0.3188549683937083
-0.8113925823642703
-0.8164252713944613
0.2027254148826998
0.11545085800415222
0.699646944795667
1.2631919819580997
0.5720583082221342
-1.1017978763919718
-1.2129823016654342
-2.385335477548532
-1.560692012549982
-1.3372429541492217
-1.187590299945919
-1.0129713023146483
-0.5955185453068381
-1.1897405763967086
-1.65225496854492
-0.7373492696533255
-1.5539374949094285
-1.2187279121846668
-0.03638133114486841
0.16504303652572716
0.003080351413739635
-0.33331309445785706
-0.6780091679744995
-2.024214407739624
-1.0273575312529601
-0.8683638998847982
-1.217008963090409
0.1906297724669237
-0.005662963845116664
0.34584550917143986
0.6888914687688743
0.07965161179673075
-0.6204679255864249
0.6017371956699851
1.164689070665455
1.8445669548529562
1.3697479646385293
1.9171142433590784
1.3844797900576697
-0.486986202987163
-0.37652961293717446
-1.249662651756278
-0.7597969415949493
-0.9225348044039432
-1.155547996088261
-1.3742227363573907
-1.5632693679729712
-1.9317035512332072
-2.231433890445322
-2.897951392256637
-2.9066916349080465
-3.726700122891483
-3.572751395649302
-4.181695806459306
-4.8442868003292165
-5.265818175927041
-7.218758183581794
-7.3735092886171465
-7.606707122132029
-8.906952866624096
-10.40390405087987
-12.97585382320391
-16.32133439794952
-19.63700272227814
-22.56490618335711
-24.094589479596713
-25.760544166401804
-47.990430502704875
-54.63365112921098
-62.01197039497653
37.16892415912668
loss: -0.4358334
episode number: 48 reward: -570.8789188242323
0.0
-0.984974986683004
-0.38117386919189666
-1.815424698669423
-0.844020144631207
-1.7120703368980994
-2.90891223031576
-4.297959033360938
-4.1074076485656486
-4.773948744237269
-6.169509514664574
-6.026159046070613
-6.110975467547865
-6.332955079715772
-6.383792015592183
-6.392971298734826
-6.312413091161778
-5.850899354089719
-6.195117879853562
-5.617828433661504
-3.754763383490401
-2.16939201945563
-0.48466587573424924
1.1709366470483165
0.6682945275066281
1.4495288671637656
3.2878229475907474
1.0697822510590065
-2.847239091175112
-3.8880188582038375
-2.632602424039156
-2.928480199017695
-1.6313472373025917
-0.311480615185598
1.1678170347922094
2.6775179450614086
2.554988241012717
4.135405036325162
5.675648184965059
7.300698525086573
9.065783326751824
6.726707794091124
6.108678585998679
5.779615933092197
6.100802040405913
3.784786661469438
1.0763726995685552
-0.18576219683789574
1.4692998696364494
2.967399787443699
2.1719245208620057
3.4933530065300484
2.038066758318661
1.080716495498745
-0.8442405912713973
-1.4661585588221682
-2.537817903447526
-4.434867401533623
-7.134804094201847
-5.518684683034431
-3.754244954079473
-3.9856034956716524
-6.2184056268359225
-7.426992786671013
-8.400066252058918
-8.673387987983414
-8.544736841605213
-7.343862662377578
-6.019012067125065
-6.648635127158768
-8.042537949172353
-8.189352356083994
-8.08863299910325
-7.636784617669594
-6.637779039064167
-5.5660667733197435
-4.513283995782945
-4.1274544243646325
-3.4775561157067614
-4.6942141944466655
-4.5986879575160415
-3.5011712828690307
-2.6228253754081194
-2.7494154845680363
-2.6184288320053994
-2.9774483531211224
-3.0921626881644855
-4.599970724320244
-5.586162226175446
-6.046190173365325
-7.259470291783714
-8.410886223273067
-8.32558683817804
-9.56324629975936
-10.068011877851333
-11.120324288141802
-12.748055277206127
-14.840218722047567
-17.003755089526663
-18.67554572148451
-20.497761211778013
-22.48316732897632
-23.83210833725758
-24.554638904962893
-26.524949046221245
-28.634774288462452
-31.0171142262681
-33.40029207129597
-35.67613900733505
-38.349912258477765
-50.798791504588536
-106.98760740255243
-94.52020291689752
-83.20182978096464
-75.1148394104323
-61.397013343464636
-55.207451128707476
-46.76128958411738
-39.649064086095564
-28.734428141856412
-24.40069782725932
-18.04197403776203
-14.380223765161475
-8.053102105048595
0.8743885450610946
7.453948409321299
13.126545086119904
17.995711332130142
24.592341869680723
28.30934044610784
32.51686867169278
33.57842360638992
35.25138309407424
126.96974429742185
loss: -2.333565
episode number: 49 reward: -334.27100030301557
0.0
-0.2983492966833099
-1.4439401108189527
-1.4870317958165575
-1.9699633040925306
-2.718226367998306
-2.7083729133539345
-4.846822609437455
-5.397203234832504
-4.1067005551555145
-1.861262871441367
-0.08173170692575127
-1.317836233662506
0.23496811240338467
1.9167332291032837
1.2797865673162878
3.0850710904442726
1.6174130454506166
1.3744549200443572
-0.293424391608303
-0.6857071038338152
-2.234619621988907
-4.257351060398495
-4.725836853188737
-3.6302490254121356
-3.001160287392876
-1.768588309523551
-3.8454744761745925
-2.9808039261573267
-2.4618896649212143
-1.3534399736796576
-0.069027329989785
-0.943624251529263
0.5093338012583768
-0.8208963606678878
0.6553814609575088
2.3149914400227125
3.613129882612495
4.194408966592597
1.7845823081990844
0.8082743752507682
2.4118524105176604
-0.4123201822166003
-1.0489952394201865
-5.013019956698782
-4.512021567773017
-3.9336058266802336
-4.9890707641085985
-4.495571434698206
-4.8915651750718645
-3.735480673724439
-2.8899321756344847
-4.107253541918274
-2.661438963719652
-0.9336643407032366
-1.7288828077750829
-0.7651186883699683
-2.469750136644725
-2.4623015707217757
-3.640638912873161
-3.285065134769482
-2.022601385060817
-1.1359683273159096
-0.7616206742647869
0.7208060086026293
-0.1368929581411784
-1.7035062851328533
-2.5862637836659452
-1.723402235905258
-2.511098726143146
-3.6856133810539013
-2.9329003159783404
-6.179494443918124
-5.950025669981972
-7.239297858339345
-8.414682880046268
-8.138082065690568
-7.228553130392875
-7.121225426548808
-6.249729872770743
-7.48606772468802
-6.826918129935407
-5.7480661801696264
-5.220597446911249
-4.6225012612713385
-5.424163739793247
-5.039884601658528
-5.614447224967364
-5.573810806301413
-6.842545911200055
-6.986757456374686
-7.445681062152506
-7.727592956109266
-8.210407978851617
-8.867661148883613
-9.103974709919981
-10.366404814158393
-11.123893809038826
-12.522577610803367
-12.507246652916926
-13.069876135725714
-13.171946929593716
-13.448591768268287
-15.530315848737487
-17.2490570195977
-17.915530733696485
-16.22718562832405
-14.055101877203747
-13.761448632585832
-13.916160936040036
-15.186945302536778
-15.059394565345926
-16.49872761617825
-18.07198398121639
-18.01790650056914
-30.30567002336563
-32.364085250151696
-51.62280687324187
-48.427984023020414
-26.656922292045778
70.81892596500856
loss: -1.3747431
In [185]:
render(episodes[-1], continuous_env)
Out[185]:


Once Loop Reflect

Plot

For your Interest..

The code you've written above can be easily adapted for other environments in Gym. If you like, try playing around with different environments and network structures!